Skip to content

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330

Open
seedspirit wants to merge 6 commits into
mainfrom
refactor/BA-5744
Open

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
seedspirit wants to merge 6 commits into
mainfrom
refactor/BA-5744

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented Apr 27, 2026

Summary

  • KernelNode.batch_load_live_stat (Valkey → valkey_stat.get_session_statistics_batch) is replaced by _batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a new LegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.
  • MetricValue / MovingStatValue move from common/types.py to common/metrics/types.py next to the new RATE_STAT_METRICS / DIFF_STAT_METRICS classifications and resolve_unit_hint() helper (with naming-convention fallback so plugin metrics still get a usable unit hint).
  • New: LegacyLiveStatConverter unit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.

Known wire-level gaps

The remaining gaps are addressed by follow-up PRs that wire additional sources into the same converter; this PR's converter will pick them up as those land.

Field Status
current / pct (with capacity) / unit_hint / stats.diff / stats.rate ✅ legacy-equivalent
capacity for cpu_used / net_* / io_* (and dependent pct) 🔜 sentinel-synthesized in follow-up #11535 (BA-5806). The converter consumes the CAPACITY sample as-is, so it picks up the synthesized value automatically once #11535 lands.
stats.max (all metrics) / stats.avg (cpu_util, cuda_util, plugin accel metrics) 🔜 re-supplied from Prometheus in follow-up #11360 (BA-5878). Plugin metrics are covered by Backend.AI accelerator naming convention (*_mem / *_util / *_power / *_temperature), so cuda_* and other accelerator families get stats.max / stats.avg for free. Converter wiring (mapping new value_type=max / value_type=avg samples into the legacy stats.max / stats.avg slots) is the remaining piece on top of #11360.
Non-suffix-conforming new accelerator metric kinds (e.g., hypothetical *_clock / *_voltage) ⚠️ requires extending the accel-suffix list in #11360 (single-edit extension point: _ACCEL_GAUGE_SUFFIXES_* in common/clients/prometheus/metric_types.py).

Test plan

  • pants test tests/unit/manager/api/gql_legacy/test_stat_converter.py
  • pants check on the modified files
  • pants fmt / pants lint clean
  • A/B equivalence run with scripts/test-live-stat-equivalence.sh against a live session (stash → A label, pop → B label, diff live-stat-eqv/A vs live-stat-eqv/B)
  • WebUI smoke (kernel/session detail page → live_stat panel)

🤖 Generated with Claude Code

@github-actions github-actions Bot added the size:XL 500~ LoC label Apr 27, 2026
@github-actions github-actions Bot added area:docs Documentations comp:manager Related to Manager component comp:agent Related to Agent component comp:client Related to Client component comp:common Related to Common component comp:app-proxy Related to App Proxy component labels Apr 27, 2026
seedspirit added a commit that referenced this pull request Apr 27, 2026
@seedspirit seedspirit marked this pull request as ready for review April 27, 2026 04:06
Copilot AI review requested due to automatic review settings April 27, 2026 04:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.

Changes:

  • Replaced KernelNode.batch_load_live_stat Valkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a new LegacyLiveStatConverter.
  • Moved legacy MetricValue / MovingStatValue into ai.backend.common.metrics.types, adding metric classifications and resolve_unit_hint() for unit derivation.
  • Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/manager/api/gql_legacy/test_stat_converter.py Adds unit tests for the legacy live_stat converter behavior.
src/ai/backend/manager/repositories/metric/repository.py Minor refactor of Prometheus query result unpacking for live stats.
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py Removes unreachable guard in template selection for metric types.
src/ai/backend/manager/api/gql_legacy/statistics.py Removes an unused/legacy batch loader wrapper method.
src/ai/backend/manager/api/gql_legacy/stat_converter.py Introduces LegacyLiveStatConverter to adapt Prometheus results to legacy shape.
src/ai/backend/manager/api/gql_legacy/kernel.py Switches live_stat resolver to Prometheus action + legacy conversion via dataloader.
src/ai/backend/common/types.py Removes legacy MetricValue/MovingStatValue from the common types module.
src/ai/backend/common/metrics/types.py Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications.
src/ai/backend/common/clients/valkey_client/valkey_stat/client.py Updates imports to use the moved MetricValue type.
src/ai/backend/client/output/formatters.py Updates imports to use the moved MetricValue type.
src/ai/backend/appproxy/worker/types.py Updates imports to use the moved MetricValue/MovingStatValue types.
src/ai/backend/agent/stats.py Updates imports to use the moved MetricValue/MovingStatValue types.
changes/11330.enhance.md Adds changelog entry for the migration.
Comments suppressed due to low confidence (1)

src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154

  • _get_template() no longer has a fallback branch. If an unexpected/new MetricType value ever reaches this function (e.g., enum extended in the future), Python will implicitly return None, which then propagates into MetricPreset(template=...) and fails later with a less clear error. Please add an explicit default case that raises (e.g., UnreachableError/ValueError) so failures are immediate and actionable.
    def _get_template(self, metric_type: MetricType) -> str:
        match metric_type:
            case MetricType.GAUGE:
                return _GAUGE_TEMPLATE
            case MetricType.RATE:
                return _RATE_TEMPLATE
            case MetricType.DIFF:
                return _DIFF_TEMPLATE


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +16 to +31
MetricValue = TypedDict(
"MetricValue",
{
"current": str,
"capacity": str,
"pct": str,
"unit_hint": str,
"stats.min": str,
"stats.max": str,
"stats.sum": str,
"stats.avg": str,
"stats.diff": str,
"stats.rate": str,
"stats.version": int | None,
},
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricValue now defines capacity as a required str, but there are still in-repo producers/consumers that treat capacity as nullable (e.g., alembic stats migration sets "capacity": None, and the CLI formatter checks metric["capacity"] is not None). This makes the TypedDict inconsistent with real payloads and will either break type-checking or force unsafe casts. Please either keep capacity as str | None (like the previous definition) or update all producers to always emit a string (e.g., "0") and remove/adjust the None handling accordingly.

Copilot uses AI. Check for mistakes.
@seedspirit seedspirit requested a review from a team April 27, 2026 04:41
Comment thread src/ai/backend/manager/api/gql_legacy/kernel.py
Comment thread src/ai/backend/manager/api/gql_legacy/stat_converter.py Outdated
Comment thread src/ai/backend/manager/api/gql_legacy/stat_converter.py
@jopemachine jopemachine requested a review from a team April 27, 2026 09:27
Comment thread src/ai/backend/manager/api/gql_legacy/kernel.py Outdated
seedspirit and others added 6 commits May 10, 2026 20:44
Reflects review: the converter is stateless, so callers shouldn't
instantiate it on every batch load.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Main moved the type from `manager.data.metric.types` to
`common.clients.prometheus.metric_types`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@seedspirit seedspirit requested a review from a team May 11, 2026 05:19
Comment on lines +21 to +24
Merge order from upstream is gauge -> diff -> rate, so for
RATE/DIFF metrics the same `(name, CURRENT)` tuple appears twice;
`currents[0]` is the raw gauge sample, `currents[-1]` is the
rate/diff query result.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, I plan to refactor the code so that it can be converted from an index-based to a type-based approach. Since this would require modifying the response merge logic, I did not include it in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:agent Related to Agent component comp:app-proxy Related to App Proxy component comp:client Related to Client component comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants